Introduction


Team information

Name NetID Email
Artsiom Strok astrok2
Lin Jiang linj3
Mayank Chhablani mchhab2

Description of the dataset

Life expectancy has increased dramatically over the last few centuries. Since 1900 the global average life expectancy has more than doubled and there has been a huge development in health sector in the past 15 years resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project, only data from year 2000-2015 is considered for exploration and analysis.

This dataset is a compilation of data from the Global Health Observatory (GHO) and United Nations. The GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources). The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).

This dataset is cleaned by removing some missing values, maily for population, Hepatitis B and GDP from less known countries and shared on Kaggle website (https://www.kaggle.com/kumarajarshi/life-expectancy-who). The final dataset contains 2938 observations with 22 variables which are more critical and representative among all the categories of health-related factors from year 2000 - 2015 for 193 countries.

The description of each variable for this dataset is listed below:

Numerical Response

  • Life expectancy: Life Expectancy in age

Numerical Predictors

  • Year: Year
  • Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
  • infant: deathsNumber of Infant Deaths per 1000 population
  • Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
  • Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  • Measles: Measles - number of reported cases per 1000 population
  • BMI: Average Body Mass Index of entire population
  • under-five deaths: Number of under-five deaths per 1000 population
  • Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
  • Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
  • Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  • HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
  • GDP: Gross Domestic Product per capita (in USD)
  • Population: Population of the country
  • thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
  • thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
  • Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
  • Schooling: Number of years of Schooling(years)

Categotical Predictors

  • Country: Country
  • Status: Developed or Developing status

Background information

The GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources).

The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).

By combining the data from these two databases and exploring the interactions between the variables such as immunization, moratlity, education and economic factors, we expect to build a robust multiple linear regression model to predict the life expectancy more accurately.

Statement of interest

For this project, we would like to:

  • identify the major factors that affect the life expectancy
  • explore the interactions between the variables such as immunization, moratlity, education, economic factors etc.
  • discover the corelation between different factors
  • perform residual and outlier diagnostics to exclude some influential points if any
  • build a robust multiple linear regression model to predict the life expectancy

Samples of the data

library(knitr)
data = read.csv("Life Expectancy Data.csv")
kable(t(data[sample(nrow(data), 5), ]))
2207 1634 2353 981 1624
Country Samoa Mali Slovenia Georgia Mali
Year 2009 2000 2008 2013 2010
Status Developing Developing Developed Developing Developing
Life.expectancy 76.0 49.8 78.9 74.5 56.5
Adult.Mortality 172 37 96 128 273
infant.deaths 0 60 0 1 54
Alcohol 2.88 0.47 10.94 5.91 0.60
percentage.expenditure 42.124 23.945 392.959 180.379 9.586
Hepatitis.B 44 NA NA 96 72
Measles 0 1578 0 7872 1719
BMI 72.0 15.6 55.8 54.4 2.7
under.five.deaths 0 111 0 1 91
Polio 45 53 97 94 77
Total.expenditure 5.40 6.29 8.47 7.25 6.35
Diphtheria 49 43 97 93 73
HIV.AIDS 0.1 2.5 0.1 0.1 1.5
GDP 335.11 269.35 2751.81 4274.38 78.38
Population 184826 196769 221316 3776 157585
thinness..1.19.years 0.2 11.0 1.7 2.7 8.8
thinness.5.9.years 0.2 1.9 1.7 2.8 8.5
Income.composition.of.resources 0.691 0.291 0.869 0.755 0.396
Schooling 12.9 4.4 16.9 13.5 7.3

Methods

Required packages:

  • DAAG
  • corrplot
  • ggplot2
  • magrittr
  • knitr
  • tidyr
  • GGally
  • faraway
  • lmtest
  • MASS

Data analysis:

original_data = read.csv("Life Expectancy Data.csv")
kable(t(original_data[sample(nrow(original_data), 5), ]))
1380 2304 1125 1029 2525
Country Kiribati Sierra Leone Haiti Greece Switzerland
Year 2014 2009 2013 2013 2012
Status Developing Developing Developing Developing Developed
Life.expectancy 66.1 47.1 62.7 86.0 82.7
Adult.Mortality 2 433 253 74 54
infant.deaths 0 28 14 0 0
Alcohol 0.01 3.97 5.68 7.46 9.86
percentage.expenditure 97.87 49.84 4.99 2183.11 18379.33
Hepatitis.B 75 84 68 98 NA
Measles 0 31 0 3 61
BMI 77.1 21.2 47.7 65.4 56.2
under.five.deaths 0 42 19 0 0
Polio 79 81 67 99 96
Total.expenditure 1.21 13.13 8.10 9.26 11.59
Diphtheria 75 84 68 99 96
HIV.AIDS 0.1 1.7 0.5 0.1 0.1
GDP 1684.54 394.59 81.27 21874.82 83164.39
Population 11458 63126 1431776 1965211 7996861
thinness..1.19.years 0.1 8.5 3.9 0.8 0.5
thinness.5.9.years 0.1 8.4 3.9 0.7 0.3
Income.composition.of.resources 0.597 0.375 0.483 0.860 0.932
Schooling 11.9 8.5 9.1 17.1 15.7
kable(sort(colSums(is.na(original_data)), decreasing = TRUE), col.names = "Number of missing values")
Number of missing values
Population 652
Hepatitis.B 553
GDP 448
Total.expenditure 226
Alcohol 194
Income.composition.of.resources 167
Schooling 163
BMI 34
thinness..1.19.years 34
thinness.5.9.years 34
Polio 19
Diphtheria 19
Life.expectancy 10
Adult.Mortality 10
Country 0
Year 0
Status 0
infant.deaths 0
percentage.expenditure 0
Measles 0
under.five.deaths 0
HIV.AIDS 0

1289 samples have at least one missing value. Alcohol is missing in 194 samples all of them belongs to 2015 year and these countiries definately should have alcohol consuption more than 0. The data was collected in 2015 when data about alcohol consumption simply was not available.

Life expextancy and adult mortality is missing for 10 samples in 2013, all of them belongs to islands.

Hepatitis B is missing in 553 samples. Samples belongs to different countries and years.

nrow(original_data)
## [1] 2938
data = na.omit(original_data)
nrow(original_data) - nrow(data)
## [1] 1289
data = data[-1] #exclude country from dataset
kable(t(do.call(cbind, lapply(data, summary))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
Year 2000.000 2005.000 2.008e+03 2.008e+03 2.011e+03 2.015e+03
Status 242.000 1407.000 2.420e+02 1.407e+03 2.420e+02 1.407e+03
Life.expectancy 44.000 64.400 7.170e+01 6.930e+01 7.500e+01 8.900e+01
Adult.Mortality 1.000 77.000 1.480e+02 1.682e+02 2.270e+02 7.230e+02
infant.deaths 0.000 1.000 3.000e+00 3.255e+01 2.200e+01 1.600e+03
Alcohol 0.010 0.810 3.790e+00 4.533e+00 7.340e+00 1.787e+01
percentage.expenditure 0.000 37.439 1.451e+02 6.990e+02 5.094e+02 1.896e+04
Hepatitis.B 2.000 74.000 8.900e+01 7.922e+01 9.600e+01 9.900e+01
Measles 0.000 0.000 1.500e+01 2.224e+03 3.730e+02 1.314e+05
BMI 2.000 19.500 4.370e+01 3.813e+01 5.580e+01 7.710e+01
under.five.deaths 0.000 1.000 4.000e+00 4.422e+01 2.900e+01 2.100e+03
Polio 3.000 81.000 9.300e+01 8.356e+01 9.700e+01 9.900e+01
Total.expenditure 0.740 4.410 5.840e+00 5.956e+00 7.470e+00 1.439e+01
Diphtheria 2.000 82.000 9.200e+01 8.416e+01 9.700e+01 9.900e+01
HIV.AIDS 0.100 0.100 1.000e-01 1.984e+00 7.000e-01 5.060e+01
GDP 1.681 462.150 1.593e+03 5.566e+03 4.719e+03 1.192e+05
Population 34.000 191897.000 1.420e+06 1.465e+07 7.659e+06 1.294e+09
thinness..1.19.years 0.100 1.600 3.000e+00 4.851e+00 7.100e+00 2.720e+01
thinness.5.9.years 0.100 1.700 3.200e+00 4.908e+00 7.100e+00 2.820e+01
Income.composition.of.resources 0.000 0.509 6.730e-01 6.316e-01 7.510e-01 9.360e-01
Schooling 4.200 10.300 1.230e+01 1.212e+01 1.400e+01 2.070e+01

Looking at the summary data we can already see some inconsistencies. In Infant Deaths we see that the max value listed is 1600 which doesn’t make sense since we’re working with per 1000 population data. The same or similar numbers we can see for Infant deaths, Measles, Under five deaths

boxplot(data$Adult.Mortality)

boxplot(data$infant.deaths)

boxplot(data$Measles)

boxplot(data$under.five.deaths)

data = data[
  data$Adult.Mortality < boxplot(data$Adult.Mortality, plot = FALSE)$stats[5] &
    data$infant.deaths < boxplot(data$infant.deaths, plot = FALSE)$stats[5] &
    data$Measles < boxplot(data$Measles, plot = FALSE)$stats[5] &
    data$under.five.deaths < boxplot(data$under.five.deaths, plot = FALSE)$stats[5]
  , ]
nrow(data)
## [1] 1223

This original dataset contains Country variable which we would not use for model building. Therefore, this column is excluded from the following data analysis as well as the records with NA in some of the columns mentioned above. This will reduce the size of the original dataset from 2938 to 1649 rows, which captures majority of the information and allows the speed of modeling to be more efficient.

Data Visualization

Let’s take a look on histograms for each attribute in the dataset.

ggplot(gather(data[-2]), aes(value)) +
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Let’s take a look of the dataset using plots of Life Expectancy vs Status or Year. The boxplot indicates that there is significant difference in Life Expectancy between the Developed and Developing countries. As expected, the Life Expectancy increases as the Year passes by and the Violin plot shows that the data is well distributed across different years.

par(mfrow = c(1, 2))
# Histogram of Life Expectancy
hist(
  data$Life.expectancy,
  xlab = "Life Expectancy",
  main = "Distribution of Life Expectancy",
  col = "dodgerblue",
  breaks = 25
)
# Boxplot of Life Expectancy vs Country Status (Developed vs Developing)
plot(
  data$Status,
  data$Life.expectancy,
  xlab = "Status",
  ylab = "Life Expectancy",
  main = "Life Expectancy vs. Status",
  col = c(2, 3)
)

# Violin plot of Life Expectancy vs. Year
data %>% ggplot() + geom_violin(aes(
  x = Year,
  y = Life.expectancy,
  group = Year,
  fill = Year
)) + labs(title = "Life Expectancy vs. Year")

Colinearity

Colinearity issue is visualized using the following plots. Resulsts show that some predictors have strong collinearity issues, such as infant.deaths vs. under.five.deaths, GDP vs. percentage.expenditure, Population vs. thinness..1.19.years etc.

corrplot(
  cor(data[-c(2)]),
  method = "color",
  order = "hclust",
  type = "lower",
  outline = TRUE,
  tl.col = "black",
  addCoef.col = "darkgreen",
)

ggpairs(data[-c(2)])

Train test split

# train test split 70/30 hold out
train_size = floor(0.7 * nrow(data))
train_idx = sample(nrow(data), train_size)
data.train = data[train_idx, ]
data.test = data[-train_idx, ]

Model Building

Steps to generate best model:

We will start with simple additive model and then use AIC/ BIC to weed out unnecessary predictors. We then, may try out interactions and see if that improves our model and later, we may use response or, predictor transformations to improve our selected model.

Additve Model with all predictors

model.additive = lm(Life.expectancy ~ ., data.train)
#summary(model.additive)
par(mfrow = c(2, 2))
plot(model.additive)

So, looking at Residual vs Fitted plot that we may have issue of linearity and equal-variance. Similary, we need to check for normality assumption as dictated by Normal Q-Q plot

Let’s check for Homoscedasticityand Normality assumptions.

show_metrics(model.additive)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00579662413223928 0.00170939760752026 0.794723028436924 3.26990415988085 10.1303636607145 11.3823057358887

So, from above data it looks liek although Adjusted R^2 and RMSE is better, but, we REJECT both Homoscedasticityand Normality assumptions. Also, from summary of model it seems there is need to prune out some predictors.

Let’s apply AIC and BIC to see if we can improve our model

model.additive.step.aic = step(model.additive, trace = FALSE)

par(mfrow = c(2, 2))
plot(model.additive.step.aic)

So, looking at AIC model, we can see that GDPvs percentage.expenditure collinearity issue has been removed by AIC. Also, some of the non-significant predictors are also removed.

show_metrics(model.additive.step.aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00525633829895563 0.00166093453444593 0.795390043832459 3.25298734166675 10.194188632922 11.5230756693097

Apart from reduction in predictors, there isn’t a significant improvement. Let’s try out BIC:

model.additive.step.bic = step(model.additive, k = log(nrow(data.train)), trace = FALSE)

par(mfrow = c(2, 2))
plot(model.additive.step.bic)

show_metrics(model.additive.step.bic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00144111966160306 0.00567905041100767 0.792977603491778 3.26335284039806 10.3510885367102 11.5702143134234

So, BIC and AIC resutls are almost same and so does the evaluatin metric but, we saw two improvements: (1) Reduction in number of predictors (2) Reduction in influential points as per the Residual vs Leverage graph.

Hence, let’s select AIC model and try interactions. The BIC is little aggressive in removing predictors so, we are selecting AIC additive model.

Also, let’s check for variance-inflation-factor

vif_aic_all_model = vif(model.additive.step.aic)
vif_aic_all_model[vif_aic_all_model > 5]
##     infant.deaths under.five.deaths 
##             58.33             60.52

As evident from above VIF computation, infant.deaths and under.five.deaths in the model are causing collinearity issue. This same issue we found during our correlation plot. Let’s try to improve the model and remove issues which are violating linearity assumptions.

Applying Two-Way Interaction:

model.interaction = lm(Life.expectancy ~ . ^ 2, data = data.train)

Since, this will explode the number of predictors let’s generate AIC and BIC models for this interaction model.

AIC Interaction Model:

model.interaction.step.aic = step(model.interaction, trace = FALSE)

par(mfrow = c(2, 2))
plot(model.interaction.step.aic)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

show_metrics(model.interaction.step.aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000525815047053244 3.33889417172493e-07 0.886282746910819 2.84634121890066 5.05408486008749 12.6735331315119

AIC and BIC model are again somewhat similar. Total numbr of predictors used by AIC is 119 and we can see that with interactions we have made improvement in adjusted r_squared and bptest. Let’s see the VIF imapct

vif_interaction_model = vif(model.interaction.step.aic)
vif_interaction_model[vif_interaction_model > 5]
##                                                   Year 
##                                              7.290e+01 
##                                       StatusDeveloping 
##                                              1.013e+03 
##                                        Adult.Mortality 
##                                              5.244e+05 
##                                          infant.deaths 
##                                              5.502e+03 
##                                                Alcohol 
##                                              3.834e+02 
##                                 percentage.expenditure 
##                                              7.593e+06 
##                                            Hepatitis.B 
##                                              1.246e+02 
##                                                Measles 
##                                              8.896e+01 
##                                                    BMI 
##                                              1.448e+02 
##                                      under.five.deaths 
##                                              7.173e+05 
##                                                  Polio 
##                                              2.944e+01 
##                                      Total.expenditure 
##                                              1.423e+02 
##                                             Diphtheria 
##                                              2.881e+01 
##                                               HIV.AIDS 
##                                              8.311e+02 
##                                                    GDP 
##                                              7.841e+06 
##                                             Population 
##                                              4.903e+05 
##                                   thinness..1.19.years 
##                                              7.891e+07 
##                                     thinness.5.9.years 
##                                              8.197e+07 
##                        Income.composition.of.resources 
##                                              8.927e+05 
##                                              Schooling 
##                                              1.209e+02 
##                                   Year:Adult.Mortality 
##                                              5.276e+05 
##                            Year:percentage.expenditure 
##                                              7.691e+06 
##                                 Year:under.five.deaths 
##                                              7.067e+05 
##                                               Year:GDP 
##                                              7.962e+06 
##                                        Year:Population 
##                                              4.937e+05 
##                              Year:thinness..1.19.years 
##                                              8.024e+07 
##                                Year:thinness.5.9.years 
##                                              8.338e+07 
##                   Year:Income.composition.of.resources 
##                                              8.956e+05 
##                       StatusDeveloping:Adult.Mortality 
##                                              4.983e+01 
##                               StatusDeveloping:Alcohol 
##                                              1.956e+01 
##                           StatusDeveloping:Hepatitis.B 
##                                              4.447e+01 
##                     StatusDeveloping:under.five.deaths 
##                                              1.208e+03 
##                     StatusDeveloping:Total.expenditure 
##                                              2.999e+01 
##                    StatusDeveloping:thinness.5.9.years 
##                                              3.545e+02 
##       StatusDeveloping:Income.composition.of.resources 
##                                              6.290e+02 
##                          Adult.Mortality:infant.deaths 
##                                              9.013e+02 
##                                Adult.Mortality:Alcohol 
##                                              1.432e+01 
##                 Adult.Mortality:percentage.expenditure 
##                                              1.325e+01 
##                            Adult.Mortality:Hepatitis.B 
##                                              2.113e+01 
##                                Adult.Mortality:Measles 
##                                              8.447e+00 
##                                    Adult.Mortality:BMI 
##                                              1.420e+01 
##                      Adult.Mortality:under.five.deaths 
##                                              8.973e+02 
##                             Adult.Mortality:Population 
##                                              1.095e+01 
##                     Adult.Mortality:thinness.5.9.years 
##                                              3.221e+01 
##                              Adult.Mortality:Schooling 
##                                              7.211e+01 
##                        infant.deaths:Total.expenditure 
##                                              1.263e+03 
##                               infant.deaths:Diphtheria 
##                                              2.103e+03 
##          infant.deaths:Income.composition.of.resources 
##                                              3.992e+03 
##                                        Alcohol:Measles 
##                                              8.607e+00 
##                              Alcohol:under.five.deaths 
##                                              1.024e+01 
##                              Alcohol:Total.expenditure 
##                                              3.862e+01 
##                                       Alcohol:HIV.AIDS 
##                                              3.750e+01 
##                                            Alcohol:GDP 
##                                              4.016e+01 
##                                     Alcohol:Population 
##                                              2.054e+01 
##                           Alcohol:thinness..1.19.years 
##                                              9.749e+02 
##                             Alcohol:thinness.5.9.years 
##                                              1.015e+03 
##                Alcohol:Income.composition.of.resources 
##                                              2.703e+02 
##                                      Alcohol:Schooling 
##                                              2.089e+02 
##                     percentage.expenditure:Hepatitis.B 
##                                              5.239e+02 
##               percentage.expenditure:Total.expenditure 
##                                              1.793e+01 
##                      percentage.expenditure:Population 
##                                              1.216e+02 
## percentage.expenditure:Income.composition.of.resources 
##                                              3.259e+03 
##                          Hepatitis.B:Total.expenditure 
##                                              2.868e+01 
##                                        Hepatitis.B:GDP 
##                                              6.632e+02 
##            Hepatitis.B:Income.composition.of.resources 
##                                              1.435e+02 
##                                  Hepatitis.B:Schooling 
##                                              1.308e+02 
##                                          Measles:Polio 
##                                              5.067e+01 
##                                       Measles:HIV.AIDS 
##                                              6.416e+00 
##                           Measles:thinness..1.19.years 
##                                              4.360e+03 
##                             Measles:thinness.5.9.years 
##                                              4.253e+03 
##                Measles:Income.composition.of.resources 
##                                              3.503e+01 
##                                         BMI:Diphtheria 
##                                              6.419e+01 
##                                                BMI:GDP 
##                                              1.042e+01 
##                    BMI:Income.composition.of.resources 
##                                              7.738e+01 
##                                          BMI:Schooling 
##                                              1.582e+02 
##                                under.five.deaths:Polio 
##                                              4.804e+01 
##                    under.five.deaths:Total.expenditure 
##                                              1.184e+03 
##                           under.five.deaths:Diphtheria 
##                                              1.798e+03 
##                 under.five.deaths:thinness..1.19.years 
##                                              2.732e+03 
##                   under.five.deaths:thinness.5.9.years 
##                                              2.539e+03 
##      under.five.deaths:Income.composition.of.resources 
##                                              3.627e+03 
##                            under.five.deaths:Schooling 
##                                              2.231e+02 
##                  Polio:Income.composition.of.resources 
##                                              7.438e+01 
##                             Total.expenditure:HIV.AIDS 
##                                              6.454e+01 
##                 Total.expenditure:thinness..1.19.years 
##                                              1.005e+03 
##                   Total.expenditure:thinness.5.9.years 
##                                              9.961e+02 
##      Total.expenditure:Income.composition.of.resources 
##                                              7.068e+01 
##                            Total.expenditure:Schooling 
##                                              1.691e+02 
##                                    Diphtheria:HIV.AIDS 
##                                              5.251e+01 
##                                  Diphtheria:Population 
##                                              6.519e+01 
##                        Diphtheria:thinness..1.19.years 
##                                              8.311e+01 
##                                    HIV.AIDS:Population 
##                                              5.294e+00 
##                          HIV.AIDS:thinness..1.19.years 
##                                              6.761e+03 
##                            HIV.AIDS:thinness.5.9.years 
##                                              6.933e+03 
##                                     HIV.AIDS:Schooling 
##                                              3.475e+02 
##                                         GDP:Population 
##                                              1.209e+02 
##                                 GDP:thinness.5.9.years 
##                                              6.993e+00 
##                    GDP:Income.composition.of.resources 
##                                              3.210e+03 
##             Population:Income.composition.of.resources 
##                                              4.038e+02 
##                thinness..1.19.years:thinness.5.9.years 
##                                              2.620e+01 
##                         thinness..1.19.years:Schooling 
##                                              1.888e+04 
##                           thinness.5.9.years:Schooling 
##                                              1.926e+04

This interaction model is getting overly complicated. Let’s inspect pairs scatter plot.

pairs(data.frame(data$Life.expectancy, data$Year, data$Status, data$Adult.Mortality, data$infant.deaths, data$Alcohol, data$percentage.expenditure, data$BMI, data$under.five.deaths, data$Polio, data$Total.expenditure, data$HIV.AIDS, data$thinness.5.9.years, data$Income.composition.of.resources, data$Schooling))

Based on this cross-predictor plots, we confirm that (a) HIV.AIDS has non-linear relationship with Life.expectancy' and, (B)HIV.AIDSandAdult.Mortalityare also interacting non-linearly. Also, we may condsider removingYear` because it may divert model in different direction because of its chaotic relationship with the response.

Check for impact of high VIF predictors.

model_selected_wo_underfive = lm(
  Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

model_selected_wo_infant.deaths = lm(
  Life.expectancy ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

model_selected_underfive = lm(
  under.five.deaths ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI  + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

model_selected_infant.deaths = lm(
  infant.deaths ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
  data = data.train
)

cor(resid(model_selected_underfive),
    resid(model_selected_wo_underfive))
## [1] -0.2811
cor(resid(model_selected_infant.deaths),
    resid(model_selected_wo_infant.deaths))
## [1] 0.2489

Based on this we can get rid of under.five.deaths

model_selected = lm(
  Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling ,
  data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected)

show_metrics(model_selected)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
7.98172922613717e-09 0.00193882463659615 0.777242563266194 3.40487072497974 11.0851769964661 12.3016534609215
vif(model_selected)
##                            Year                StatusDeveloping 
##                           1.148                           1.797 
##                 Adult.Mortality                   infant.deaths 
##                           1.425                           1.219 
##                         Alcohol          percentage.expenditure 
##                           2.335                           1.361 
##                             BMI                           Polio 
##                           1.651                           1.078 
##               Total.expenditure                        HIV.AIDS 
##                           1.109                           1.112 
##              thinness.5.9.years Income.composition.of.resources 
##                           1.763                           2.495 
##                       Schooling 
##                           2.952

Looks like none of our predictor is causing inflation in variance. Let’s consider interaction of these predictors.

model_selected_interactions = lm(
  Life.expectancy ~ (
    Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling
  ) ^ 2 ,
  data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected_interactions)

show_metrics(model_selected_interactions)
## Warning in predict.lm(model, data.train): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(model, data.test): prediction from a rank-deficient fit
## may be misleading
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000195608009256019 6.0015044267283e-09 0.864843837403561 2.93777091664604 6.11076547589384 9.8909471303747
model_selected_interactions_aic = step(model_selected_interactions, trace = FALSE)
#summary(model_selected_interactions_aic)
show_metrics(model_selected_interactions_aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000211569371785971 1.65193456244177e-08 0.868235728505457 2.74564358446409 6.19882051252615 9.11938755991418
model_selected_interactions_bic=step(model_selected_interactions, trace=0, k = log(nrow(data.train)))
#summary(model_selected_interactions_bic)

show_metrics(model_selected_interactions_bic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.193951689327949 1.5500162773493e-07 0.85838311309652 2.75908057402614 6.86320899781926 8.76115213179641
par(mfrow=c(2,2))
plot(model_selected_interactions_bic)

model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + infant.deaths:Polio +  Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)

show_metrics(model_selected_poly_interactions)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000277590890442672 0.000138535706774098 0.787761353796874 3.37355783681493 10.5240949120107 11.1480832539975
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)

show_metrics(model_selected_poly_interactions)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000156999247869917 0.000738571059426802 0.786944447652274 3.37396800603683 10.5771940309185 10.8899030481685
vif_poly = vif(model_selected_poly_interactions)
vif_poly[vif_poly > 5]
##      HIV.AIDS I(HIV.AIDS^2) 
##         17.90         11.57
length(coef(model_selected_poly_interactions))
## [1] 16
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI   + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
model_selected_poly_interactions_aic=step(model_selected_poly_interactions, trace=0, k=log(nrow(data)))
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions_aic)

show_metrics(model_selected_poly_interactions_aic)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000710282531229792 0.00386349710523811 0.785613734213058 3.37519650283138 10.7066103429102 11.2057617015114
vif_poly = vif(model_selected_poly_interactions_aic)
vif_poly[vif_poly > 5]
##      HIV.AIDS I(HIV.AIDS^2) 
##         17.22         11.22
length(coef(model_selected_poly_interactions_aic))
## [1] 11
model_selected_outliers = lm(
  Life.expectancy ~  Adult.Mortality  +
    infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources  +
    Schooling  + Adult.Mortality:HIV.AIDS + I(HIV.AIDS ^ 2),
  data = data.train,
  subset = abs(rstandard(model_selected_poly_interactions_aic)) < 2
)

par(mfrow = c(2, 2))
plot(model_selected_outliers)

show_metrics(model_selected_outliers)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000168571821545005 0.284757727371997 0.834961405242957 2.84052295459347 11.7629907297865 12.5618951657768

Based on above plots, we can say that we have achieved a good model based with 10 predictors and the formula is:

formula = Life.expectancy ~ Adult.Mortality + infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2)

Also, looking at the model it seems, we are satisfying assumptions of linear model.

Evaluation metric are as follows:

show_metrics(model_selected_outliers)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000168571821545005 0.284757727371997 0.834961405242957 2.84052295459347 11.7629907297865 12.5618951657768

Box-Cox confirms that it is not necessary to transform y variable.

# Box-Cox Transformation
par(mfrow = c(1, 2))
boxcox(model_selected_outliers, plotit = TRUE)
boxcox(model_selected_outliers,
       plotit = TRUE,
       lambda = seq(0.5, 1.5, by = 0.1))

Explore the relationship between Life.expectancy and each predictor.

plot(Life.expectancy ~ Adult.Mortality, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Adult.Mortality")

plot(Life.expectancy ~ infant.deaths, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs infant.deaths")

plot(Life.expectancy ~ Polio, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Polio")

plot(Life.expectancy ~ HIV.AIDS, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs HIV.AIDS")

plot(Life.expectancy ~ thinness.5.9.years, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs thinness.5.9.years")

plot(Life.expectancy ~ Income.composition.of.resources, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Income.composition.of.resources")

plot(Life.expectancy ~ Schooling, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Schooling")

lm_le_am = lm(Life.expectancy ~ Adult.Mortality, data = data.train)
show_metrics(lm_le_am)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.31946975560735e-13 2.0848220719532e-20 0.381979209365818 5.60059571099637 31.1931536544079 43.7833555785532
lm_le_am_log = lm(Life.expectancy ~ log(Adult.Mortality), data = data.train)
show_metrics(lm_le_am_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00765029182472552 1.94261036561361e-11 0.126165812914956 6.65965537964702 44.104736409026 54.7316397611221
# bp.test result is significantly improved with log transformation of Adult.Mortality.
lm_le_am_poly = lm(Life.expectancy ~ Adult.Mortality + I(Adult.Mortality ^ 2), data = data.train)
show_metrics(lm_le_am_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
3.6254656266214e-17 4.17249954864513e-19 0.49425651121724 5.07252530413321 25.4963293100404 40.8071731000647
# Polynomial transformation doesn't work.
lm_le_id = lm(Life.expectancy ~ infant.deaths, data = data.train)
show_metrics(lm_le_id)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
7.31924441589467e-12 0.000156447375358829 0.184186924009964 6.43969288113596 41.1762565568694 50.9169461839039
# lm_le_id_log = lm(Life.expectancy ~ log(infant.deaths), data = data.train)
# couldn't perform log transformation because of O values.

lm_le_id_poly = lm(Life.expectancy ~ infant.deaths + I(infant.deaths ^ 2), data = data.train)
show_metrics(lm_le_id_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000508653204248575 0.00323441330816766 0.23744401022677 6.22439700666001 38.4431615311481 47.7457295194962
# Polynomial transformation of infant.deaths doesn't improve.
lm_le_po = lm(Life.expectancy ~ Polio, data = data.train)
show_metrics(lm_le_po)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0003241416647168 6.92392393360178e-09 0.0313639370164055 7.01114361794841 48.8896393223974 57.6167964758711
lm_le_po_log = lm(Life.expectancy ~ log(Polio), data = data.train)
show_metrics(lm_le_po_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.12016931659563 4.23545456651714e-10 0.00884845631969655 7.09007148873979 50.0260555394871 59.6827367520386
# bp.test result is improved with log transformation of Polio.
lm_le_po_poly = lm(Life.expectancy ~ Polio + I(Polio ^ 2), data = data.train)
show_metrics(lm_le_po_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.00158567015839279 2.24299062264698e-06 0.108777167255176 6.72829540560709 44.9297150359352 52.7376816476337
# Polynomial transformation also helps.
lm_le_hiv = lm(Life.expectancy ~ HIV.AIDS, data = data.train)
show_metrics(lm_le_hiv)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
4.03952821856978e-36 0.0000198012706842361 0.209929036050984 6.4313166566686 39.8769836708222 45.5161980673164
lm_le_hiv_log = lm(Life.expectancy ~ log(HIV.AIDS), data = data.train)
show_metrics(lm_le_hiv_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.552652455526738 0.0137724327525417 0.502504519095959 5.02244483703061 25.1099205939158 26.6822419695744
# bp.test result is significantly improved with log transformation of HIV.AIDS.
lm_le_hiv_poly = lm(Life.expectancy ~ HIV.AIDS + I(HIV.AIDS ^ 2), data = data.train)
show_metrics(lm_le_hiv_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.6142281531311e-10 0.000196555279081974 0.34914503706125 5.92763237188043 32.8119152025081 38.699017037271
# Polynomial transformation also helps.
lm_le_hiv_log_poly = lm(Life.expectancy ~ log(HIV.AIDS) + I(HIV.AIDS ^ 2), data = data.train) # doesn't further improve
show_metrics(lm_le_hiv_log_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.562882116468676 0.0153407307548998 0.502270186543973 5.02228094118412 25.0923314145739 26.5322220039606
lm_le_thn = lm(Life.expectancy ~ thinness.5.9.years, data = data.train)
show_metrics(lm_le_thn)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
6.14282579948944e-28 0.0000221260079069445 0.263294723235464 6.12094400142273 37.1834754499896 45.1770655485404
lm_le_thn_log = lm(Life.expectancy ~ log(thinness.5.9.years), data = data.train)
show_metrics(lm_le_thn_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0545453516781846 1.51200791992227e-14 0.305222620082861 5.94407704583064 35.0672629396873 44.3911102756058
# bp.test result is significantly improved with log transformation.
lm_le_thn_poly = lm(Life.expectancy ~ thinness.5.9.years + I(thinness.5.9.years ^ 2),
                    data = data.train)
show_metrics(lm_le_thn_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.80632932005911e-11 3.62577805189945e-06 0.422194648972061 5.42774268112917 29.1292242681542 37.4516765358592
# Polynomial transformation doesn't help a lot.
lm_le_icr = lm(Life.expectancy ~ Income.composition.of.resources, data = data.train)
show_metrics(lm_le_icr)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
7.37781511376016e-54 2.33428825623606e-13 0.461835527231953 5.24157184605309 27.1625928201721 29.4738601564873
# lm_le_icr_log = lm(Life.expectancy ~ log(Income.composition.of.resources), data = data.train) couldn't perform log tranformation due to 0 values.

# bp.test result is significantly improved with log transformation.
lm_le_icr_poly = lm(
  Life.expectancy ~ Income.composition.of.resources + I(Income.composition.of.resources ^ 2),
  data = data.train
)
show_metrics(lm_le_icr_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
5.4612618030595e-13 9.09702150976323e-07 0.677278836238755 4.05060304703991 16.2695224932704 17.6472405364083
# Polynomial transformation helps a little.
lm_le_sch = lm(Life.expectancy ~ Schooling, data = data.train)
show_metrics(lm_le_sch)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000304706559981915 1.07400645161098e-07 0.537674227027625 4.84287420568507 23.3348118595196 27.6627365688521
lm_le_sch_log = lm(Life.expectancy ~ log(Schooling), data = data.train)
show_metrics(lm_le_sch_log)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
1.00148928652513e-12 3.03810787691108e-07 0.519227419342565 4.942334569492 24.2658713243039 29.1745366043211
# log transformation doesn't work.
lm_le_sch_poly = lm(Life.expectancy ~ Schooling + I(Schooling ^ 2), data = data.train)
show_metrics(lm_le_sch_poly)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
2.25812155214884e-06 2.32723693100681e-07 0.539000865569827 4.83978457056023 23.240607153173 27.6900962712899
# Polynomial transformation doesn't help very much.
# New model based on the transformation result
model_trans = lm(Life.expectancy ~ log(Adult.Mortality) + infant.deaths + Polio  + I(Polio ^ 2) + HIV.AIDS + I(HIV.AIDS^2) + log(thinness.5.9.years) + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS, data = data.train)
show_metrics(model_selected_outliers)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.0000168571821545005 0.284757727371997 0.834961405242957 2.84052295459347 11.7629907297865 12.5618951657768
show_metrics(model_trans)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.000073871493434572 0.000773165514142332 0.745553260163814 3.64968393207418 12.7072603576073 12.7752016812164
# After transformation, it shows that bptest and shapiro test results improves a little.

Results

The best model stastified all LINE assumptions. To build the model we performed data analysis where we removed all records with NA values and outliers. We looked at correlation matrix between predictors and excluded highly correlated predictors. We started from simple additive model then we tried to search better model using backward search using both AIC and BIC evaluation criteria. Next we tried to apply 2 level interactions for every predictor and perform search again. We pick the best selected model and started manually explore different interactions and transformations.

best_model = model_selected_interactions_bic
par(mfrow = c(2, 2))
plot(best_model)

show_metrics(best_model)
bptest shapiro_test adj_r2 LOOCV TRAIN_RMSE TEST_RMSE
0.193951689327949 1.5500162773493e-07 0.85838311309652 2.75908057402614 6.86320899781926 8.76115213179641